# Upgrade the libraries
#!pip install pandas --upgrade
#!pip install numpy -- upgrade
#!pip install matplotlib --upgrade
#!pip install seaborn --upgrade
#!pip install statsmodels - upgrade
#!pip install --upgrade plotly
#!pip install --upgrade bokeh
# The last time the libraries were upgraded was on 04/12/2023
#!pip install tabulate
#!pip install ipywidgets
#!pip install yfinance
#!pip install pandas matplotlib statsmodels pmdarima
# The last installations were on 13/12/2023
# Import the necessary Libraries
import pandas as pd # Used to provide data structures like DataFrame for efficient data manipulation and analysis.
import numpy as np # Used for numerical operations on arrays and matrices.
import matplotlib.pyplot as plt # Used to create static, animated, and interactive visualizations in Python.
import seaborn as sns # Used for statistical data visualization.
import statsmodels.api as sm # Used for linear regression.
import plotly.graph_objects as go # Importing Plotly with the alias 'go' for creating visualizations
from bokeh.plotting import figure, show # Importing Bokeh for interactive visualizations
import json # to convert Json file to csv
import csv # to read CSV files
from tabulate import tabulate
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
| Author | Student Number | Course and Project | Professor |
|---|---|---|---|
| Adriana Moreira | D22125858 | Working With Data - Time Series | Lucas Rizzo |
Objective for Analyzing Bitcoin & Cryptocurrency Datasets:¶
The primary objective of analyzing the Bitcoin Historical Data and Cryptocurrency Historical Prices Dataset is to gain insights into the trading dynamics and historical performance of Bitcoin and various cryptocurrencies. The datasets offer a comprehensive view of the market activities, including price movements, trading volumes, and other relevant metrics. The analysis aims to answer several key questions to understand the trends, behavior, and challenges in the cryptocurrency mak.et

Data Understanding¶
The Datasets (Bitcoin & Cryptocurrency)¶
This studies will analyse and combaine 2 datasets Bitcoin and Cryptocurrency. The Bitcoin dataset covers Bitcoin trends per second, while the Cryptocurrency dataset cover trends of all crypto currence in the world daily (using the mean calculation). It is important to address that both datasets has their date set in a America style MM/DD/YYYY.
Bitcoin Historical Data:¶
The Bitcoin dataset covers the time period from January 2012 to March 2021, providing minute-to-minute updates on the trading activity of Bitcoin. It includes data from select Bitcoin exchanges.
A Few Key Aspects of the Bitcoin Historical Dataset:¶
Data Fields:
- The dataset includes information such as OHLC (Open, High, Low, Close) prices, volume in BTC, and the indicated currency. Additionally, it provides the weighted Bitcoin price.
a. Column Description:
- open: Opening price on that particular date (Unix Time)
- high: Highest price hit on that particular date (Unix Time)
- low: Lowest price hit on that particular date (Unix Time)
- close: Closing price on that particular date (Unix Time)
- Volume_(BTC): Volume of BTC transacted in this window
- Volume_(Currency): Volume of corresponding currency transacted in this window
- Weighted_Price: VWAP- Volume Weighted Average Price
- crypto_name: Name of the cryptocurrency
- Timestamp: Timestamps are in Unix time
Timestamps:
- The timestamps in the dataset are in Unix time format. Unix time represents the number of seconds that have elapsed since January 1, 1970 (the Unix epoch). This format is commonly used in computer systems for timestamping.
Timestamp1048575 non-nullint64
- The timestamps in the dataset are in Unix time format. Unix time represents the number of seconds that have elapsed since January 1, 1970 (the Unix epoch). This format is commonly used in computer systems for timestamping.
Data Quality and Issues:
- The dataset acknowledges that some timestamps may have missing data or NaNs (Not a Number) for timestamps without any trades or activity. These gaps could be due to technical issues such as exchange downtime, API problems, or other unforeseen errors during data reporting or gathering.
Deduplication:
- The dataset creator has made efforts to deduplicate entries and ensure the correctness and completeness of the data. However, users are advised to exercise caution and trust the data at their own risk.
Acknowledgments: The dataset creator acknowledges the contribution of Bitcoin charts for the data, various exchange APIs for providing data (despite challenges), and Satoshi Nakamoto for introducing the blockchain concept through the Bitcoin protocol.
Source: Bitcoin Historical Data on Kaggle
Cryptocurrency Historical Prices Dataset:¶
A Few Key Aspects of the Cryptocurrency Dataset:¶
The Cryptocurrency dataset consists of over 50 cryptocurrencies' historical OHLC (Open High Low Close) data. The date range is from May 2013 to October 2022 on a daily basis. The prices are represented in USD or $. The data was scraped and cleaned from the Coin Market Cap website using automated scripts.
Data Fields:
- a. Column Description:
- open: Opening price on that particular date (UTC time)
- high: Highest price hit on that particular date (UTC time)
- low: Lowest price hit on that particular date (UTC time)
- close: Closing price on that particular date (UTC time)
- volume: Quantity of asset bought or sold, displayed in the base currency
- marketCap: The total value of all the coins that have been mined. It's calculated by multiplying the number of coins in circulation by the current market price of a single coin
- timestamp: UTC timestamp of the day considered
- crypto_name: Name of the cryptocurrency
- date: Timestamp 'converted to date
- a. Column Description:
Timestamps:
- timestamp: UTC timestamp of the day considered. timestamp: 72946 non-null
object
- timestamp: UTC timestamp of the day considered. timestamp: 72946 non-null
Data Quality and Issues:
- No missing data is reported.
Source: Cryptocurrency Historical Prices Dataset on Kaggle y-historical-prices-dataset) o datee itcoin protocol. rploration.
Part 1: Importing, Cleaning, and Merging the Datasets¶
Objectives:¶
- Import the piece(s) of data (series data, audio data, or image files) and perform the appropriate cleaning and merging (if necessary) to produce the final data object(s). This may include handling missing values, noise reduction, and data transformation specific to the data type (e.g., audio normalization, image resizing, grayscaling, etc.).
Challenges:¶
Involving data in multiple formats or files.
- JSON and CSV files.
Meticulously cleaning and applying relevant transformations.
- Convert the JSON file into a CSV file.
- Convert the 'timestamp' to datetime format for easier manipulation and analysis using
pd.to_datetime. - Drop columns on the Bitcoin dataset (Volume(Currency), and Weighted_Price).
- Rename multiple columns for consistency on the Cryptocurrency dataset and on the Bitcoin dataset.
- Create a new column on the dataset Bitcoin called "date."
- Create a data frame for the converted JSON file.
- Handle missing data.
- Concat the datasets.
- Check for outliers.
Exhibits elements of creativity, e.g., problem-solving and self-learned concepts.
- to be continued.
# Import the datasets:
# Cryptocurrency_df JSON (Dataset 1)
# Load JSON file
with open('Cryptocurrency.json', 'r') as json_file:
data = json.load(json_file)
# Create DataFrame for Cryptocurrency_df (Dataset 1)
Cryptocurrency_df = pd.DataFrame(data)
# Bitcoin CSV file (Dataset 2)
# Read the CSV file into a DataFrame
pd.set_option('display.float_format', '{:.2f}'.format)
Bitcoin = pd.read_csv('Bitcoin.csv')
# Display information about the DataFrame 1 (Cryptocurrency_df)
Cryptocurrency_df.info()
# Display the first few rows of the DataFrame Cryptocurrency_df
Cryptocurrency_df.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 72946 entries, 0 to 72945 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 72946 non-null int64 1 open 72946 non-null float64 2 high 72946 non-null float64 3 low 72946 non-null float64 4 close 72946 non-null float64 5 volume 72946 non-null float64 6 marketCap 72946 non-null float64 7 timestamp 72946 non-null object 8 crypto_name 72946 non-null object 9 date 72946 non-null object dtypes: float64(6), int64(1), object(3) memory usage: 5.6+ MB
| open | high | low | close | volume | marketCap | timestamp | crypto_name | date | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 112.90 | 118.80 | 107.14 | 115.91 | 0.00 | 1288693175.50 | 2013-05-05T23:59:59.999Z | Bitcoin | 2013-05-05 |
| 1 | 1 | 3.49 | 3.69 | 3.35 | 3.59 | 0.00 | 62298185.43 | 2013-05-05T23:59:59.999Z | Litecoin | 2013-05-05 |
| 2 | 2 | 115.98 | 124.66 | 106.64 | 112.30 | 0.00 | 1249023060.00 | 2013-05-06T23:59:59.999Z | Bitcoin | 2013-05-06 |
| 3 | 3 | 3.59 | 3.78 | 3.12 | 3.37 | 0.00 | 58594361.23 | 2013-05-06T23:59:59.999Z | Litecoin | 2013-05-06 |
| 4 | 4 | 112.25 | 113.44 | 97.70 | 111.50 | 0.00 | 1240593600.00 | 2013-05-07T23:59:59.999Z | Bitcoin | 2013-05-07 |
# Format and display the describe method
pd.set_option('display.float_format', '{:.2f}'.format)
Cryptocurrency_df.describe()
| open | high | low | close | volume | marketCap | ||
|---|---|---|---|---|---|---|---|
| count | 72946.00 | 72946.00 | 72946.00 | 72946.00 | 72946.00 | 72946.00 | 72946.00 |
| mean | 36472.50 | 870.19 | 896.41 | 844.06 | 871.29 | 2207607310.51 | 14749221288.69 |
| std | 21057.84 | 5231.65 | 5398.61 | 5079.39 | 5235.51 | 9617884904.43 | 75011591365.81 |
| min | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 18236.25 | 0.17 | 0.18 | 0.16 | 0.17 | 8320617.59 | 186043250.00 |
| 50% | 36472.50 | 1.63 | 1.72 | 1.54 | 1.64 | 109875645.96 | 1268539252.57 |
| 75% | 54708.75 | 26.07 | 27.57 | 24.79 | 26.25 | 669139847.40 | 5118618335.98 |
| max | 72945.00 | 67549.74 | 162188.26 | 66458.72 | 67566.83 | 350967941479.06 | 1274831490851.01 |
Summary of the Cryptocurrency_df dataset:¶
float64(6): Indicates that there are six columns with the data type
float64(open, high, low, close, volume, and marketCap).int64(1): Indicates that there is 1 column with the data type
int64(Unnamed: 0).object(3): Indicates that there are 3 columns with the data type
object(timestamp, crypto_name, and date).Missing Values: No missing values.
Summary Statistics (Describe Method) for Numerical Columns (Open, High, Low, Close, Volume, MarketCap):
- The standard deviation values for
Open,High,Low, andCloseare in the range of thousands, suggesting a considerable spread of prices. - The standard deviation for the
Volumecolumn is in the billions, indicating substantial variability in trading volumes. - The
MarketCapstandard deviation is also in the billions, showing significant variability in market capitalization.
- The standard deviation values for
# Display information about the DataFrame 2 (Bitcoin)
Bitcoin.info()
# Display the first few rows of the DataFrame 2 (Bitcoin)
Bitcoin.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1048575 entries, 0 to 1048574 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Timestamp 1048575 non-null int64 1 Open 342169 non-null float64 2 High 342169 non-null float64 3 Low 342169 non-null float64 4 Close 342169 non-null float64 5 Volume_(BTC) 342169 non-null float64 6 Volume_(Currency) 342169 non-null float64 7 Weighted_Price 342169 non-null float64 8 crypto_name 1048575 non-null object dtypes: float64(7), int64(1), object(1) memory usage: 72.0+ MB
| Timestamp | Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | crypto_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1325317920 | 4.39 | 4.39 | 4.39 | 4.39 | 0.46 | 2.00 | 4.39 | Bitcoin |
| 1 | 1325317980 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Bitcoin |
| 2 | 1325318040 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Bitcoin |
| 3 | 1325318100 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Bitcoin |
| 4 | 1325318160 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Bitcoin |
# Format and display the describe method
pd.set_option('display.float_format', '{:.2f}'.format)
Bitcoin.describe()
| Timestamp | Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | |
|---|---|---|---|---|---|---|---|---|
| count | 1048575.00 | 342169.00 | 342169.00 | 342169.00 | 342169.00 | 342169.00 | 342169.00 | 342169.00 |
| mean | 1356775140.00 | 229.46 | 229.80 | 229.06 | 229.44 | 16.27 | 4485.66 | 229.42 |
| std | 18161860.42 | 271.72 | 272.24 | 271.14 | 271.71 | 48.13 | 19512.65 | 271.66 |
| min | 1325317920.00 | 3.80 | 3.80 | 3.80 | 3.80 | 0.00 | 0.00 | 3.80 |
| 25% | 1341046530.00 | 93.49 | 93.54 | 93.42 | 93.49 | 0.99 | 99.00 | 93.48 |
| 50% | 1356775140.00 | 121.11 | 121.18 | 121.00 | 121.09 | 3.03 | 328.04 | 121.10 |
| 75% | 1372503750.00 | 194.30 | 194.46 | 194.15 | 194.30 | 12.00 | 1782.33 | 194.30 |
| max | 1388232360.00 | 1163.00 | 1163.00 | 1162.99 | 1163.00 | 2958.48 | 1543034.76 | 1163.00 |
Summary of the Bitcoin dataset:¶
float64(7): Indicates that there are seven columns with the data type
float64(Open, High, Low, Close, Volume_(BTC), Volume_(Currency), and Weighted_Price).int64(1): Indicates that there is 1 column with the data type
int64(Timestamp).object(1): Indicates that there is 1 column with the data type
object(crypto_name).Missing Values: The columns Open, High, Low, Close, Volume_(BTC), Volume_(Currency), and Weighted_Price have missing values. The total number of entries is
1048575, and the non-null count for these columns is342169, indicating missing values in these columns.Summary Statistics (Describe Method) for Numerical Columns (Open, High, Low, Close): The mean values for the numeric columns are as follows:
- Open: 229.456116
- High: 229.802477
- Low: 229.064930
- Close: 229.440478
- Volume_(BTC): 16.274528
- Volume_(Currency): 4485.655
- Weighted_Price: 229.420213
These mean values represent the average of each respective attribute in the Bitcoin dataset. The mean value is calculated as the svalues in column x e.g., the sum of all values in um of all 'Open' prices divided by the total number of 'Ope prices.
Part 1 a - Data Cleaning and Preprocessing¶
# Bitcoin dataset (datetime)
from datetime import datetime, timezone
# Convert 'Timestamp' column to datetime objects
Bitcoin['Timestamp'] = pd.to_datetime(Bitcoin['Timestamp'], unit='s', utc=True)
#converting the 'timestamp' to datetime format for easier manipulation by analysis using `pd.to_datetime`.
# Format and display the head method
pd.set_option('display.float_format', '{:.2f}'.format)
Bitcoin.head() #Check the changes made
| Timestamp | Open | High | Low | Close | Volume_(BTC) | Volume_(Currency) | Weighted_Price | crypto_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-12-31 07:52:00+00:00 | 4.39 | 4.39 | 4.39 | 4.39 | 0.46 | 2.00 | 4.39 | Bitcoin |
| 1 | 2011-12-31 07:53:00+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Bitcoin |
| 2 | 2011-12-31 07:54:00+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Bitcoin |
| 3 | 2011-12-31 07:55:00+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Bitcoin |
| 4 | 2011-12-31 07:56:00+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Bitcoin |
# Bitcoin dataset (cleaning)
# Columns to drop
columns_to_drop = ['Volume_(Currency)']
# Drop specified columns and copy dataset 'Bitcoin' into a new dataset called 'Bitcoin_df'
Bitcoin_df = Bitcoin.drop(columns=columns_to_drop, axis=1)
# 'Bitcoin_df' now contains the DataFrame with the specified columns dropped
# Rename multiple columns for consistency
Bitcoin_df.rename(columns={'Weighted_Price': 'MarketCap', 'Volume_(BTC)': 'Volume'}, inplace=True)
# Bitcoin dataset (Create a new column 'Date' with the specified date format ('%Y-%m-%d') )
Bitcoin_df['Date'] = Bitcoin_df['Timestamp'].dt.strftime('%Y-%m-%d')
Bitcoin_df.head() #Check the changes made
| Timestamp | Open | High | Low | Close | Volume | MarketCap | crypto_name | Date | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-12-31 07:52:00+00:00 | 4.39 | 4.39 | 4.39 | 4.39 | 0.46 | 4.39 | Bitcoin | 2011-12-31 |
| 1 | 2011-12-31 07:53:00+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | Bitcoin | 2011-12-31 |
| 2 | 2011-12-31 07:54:00+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | Bitcoin | 2011-12-31 |
| 3 | 2011-12-31 07:55:00+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | Bitcoin | 2011-12-31 |
| 4 | 2011-12-31 07:56:00+00:00 | NaN | NaN | NaN | NaN | NaN | NaN | Bitcoin | 2011-12-31 |
# Cryptocurrency_df dataset cleaning and preprocess:
# Rename multiple columns for consistency
Cryptocurrency_df = Cryptocurrency_df.rename(columns={'timestamp': 'Timestamp', 'open': 'Open', 'high': 'High', 'low': 'Low', 'close': 'Close', 'volume': 'Volume', 'marketCap': 'MarketCap', 'date': 'Date'})
# Drop the first empty string column in place in Cryptocurrency_df column by its index
column_to_drop = Cryptocurrency_df.columns[0]
Cryptocurrency_df.drop(column_to_drop, axis=1, inplace=True)
# Convert 'Timestamp' and 'Date' columns to datetime
Cryptocurrency_df[['Timestamp', 'Date']] = Cryptocurrency_df[['Timestamp', 'Date']].apply(pd.to_datetime)
#converting the 'timestamp' and 'date' columns to datetime format for easier manipulation and analysis using `pd.to_datetime`.
Cryptocurrency_df.head() # Check the changes made
| Open | High | Low | Close | Volume | MarketCap | Timestamp | crypto_name | Date | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 112.90 | 118.80 | 107.14 | 115.91 | 0.00 | 1288693175.50 | 2013-05-05 23:59:59.999000+00:00 | Bitcoin | 2013-05-05 |
| 1 | 3.49 | 3.69 | 3.35 | 3.59 | 0.00 | 62298185.43 | 2013-05-05 23:59:59.999000+00:00 | Litecoin | 2013-05-05 |
| 2 | 115.98 | 124.66 | 106.64 | 112.30 | 0.00 | 1249023060.00 | 2013-05-06 23:59:59.999000+00:00 | Bitcoin | 2013-05-06 |
| 3 | 3.59 | 3.78 | 3.12 | 3.37 | 0.00 | 58594361.23 | 2013-05-06 23:59:59.999000+00:00 | Litecoin | 2013-05-06 |
| 4 | 112.25 | 113.44 | 97.70 | 111.50 | 0.00 | 1240593600.00 | 2013-05-07 23:59:59.999000+00:00 | Bitcoin | 2013-05-07 |
Part 1 b - Handle missing values:¶
# Check for missing values in the 'Bitcoin_df' DataFrame
missing_data = Bitcoin_df.isnull()
# Summarize the missing data: Count the number of missing values in each column
missing_data_summary = missing_data.sum()
# Display the summary: Print the number of missing values for each column
print(missing_data_summary)
Timestamp 0 Open 706406 High 706406 Low 706406 Close 706406 Volume 706406 MarketCap 706406 crypto_name 0 Date 0 dtype: int64
# Bitcoin dataset - Calculate the percentage of missing data for the columns 'Open', 'High', 'Low', 'Close', 'Volume_(BTC)', 'Volume_(Currency)' & 'Weighted_Price':
# Extract the count of missing values for each column with missing data:
columns_to_check = ['Open', 'High', 'Low', 'Close', 'Volume', 'MarketCap']
# Calculate the percentage of missing data for each specified column
missing_percentage = (Bitcoin_df[columns_to_check].isnull().sum() / len(Bitcoin_df)) * 100
# Display the results
for column, percentage in missing_percentage.items():
print(f"Percentage of missing data in the '{column}' column: {percentage:.2f}%")
Percentage of missing data in the 'Open' column: 67.37% Percentage of missing data in the 'High' column: 67.37% Percentage of missing data in the 'Low' column: 67.37% Percentage of missing data in the 'Close' column: 67.37% Percentage of missing data in the 'Volume' column: 67.37% Percentage of missing data in the 'MarketCap' column: 67.37%
# Bitcoin dataset - Drop missing values
Bitcoin_df.dropna(inplace=True)
Bitcoin_df.info()
Bitcoin_df.tail()
<class 'pandas.core.frame.DataFrame'> Index: 342169 entries, 0 to 1048574 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Timestamp 342169 non-null datetime64[ns, UTC] 1 Open 342169 non-null float64 2 High 342169 non-null float64 3 Low 342169 non-null float64 4 Close 342169 non-null float64 5 Volume 342169 non-null float64 6 MarketCap 342169 non-null float64 7 crypto_name 342169 non-null object 8 Date 342169 non-null object dtypes: datetime64[ns, UTC](1), float64(6), object(2) memory usage: 26.1+ MB
| Timestamp | Open | High | Low | Close | Volume | MarketCap | crypto_name | Date | |
|---|---|---|---|---|---|---|---|---|---|
| 1048570 | 2013-12-28 12:02:00+00:00 | 734.60 | 734.60 | 730.00 | 734.55 | 1.79 | 734.01 | Bitcoin | 2013-12-28 |
| 1048571 | 2013-12-28 12:03:00+00:00 | 734.55 | 734.55 | 730.71 | 730.71 | 0.11 | 732.99 | Bitcoin | 2013-12-28 |
| 1048572 | 2013-12-28 12:04:00+00:00 | 734.40 | 734.40 | 730.51 | 730.51 | 0.55 | 734.06 | Bitcoin | 2013-12-28 |
| 1048573 | 2013-12-28 12:05:00+00:00 | 730.51 | 733.63 | 730.51 | 731.10 | 0.62 | 731.37 | Bitcoin | 2013-12-28 |
| 1048574 | 2013-12-28 12:06:00+00:00 | 733.00 | 734.00 | 733.00 | 734.00 | 9.21 | 733.37 | Bitcoin | 2013-12-28 |
Explanation of Dropping NaN values:¶
As demonstrated in the section 'Calculate the Percentage of Missing Values', the Bitcoin_df dataset had a total of 67% missing values in its historical OHLC (Open, High, Low, Close) price data.
The rationale for dropping the NaN values was that the Cryptocurrency_df dataset calculates its historical prices using daily OHLC data for each cryptocurrency, while Bitcoin_df captures the Bitcoin OHLC trends on a second-by-second basis. For this study, daily data might be more appropriate and manageable.
Consequently, the second-by-second analysis became irrelevant for this study. The significant portion of NaN values, amounting to 67%, could lead to a biased and unreliable analysis, especially because the missing data is not randomly distributed.
After removing the missing values, the Bitcoin_df DataFrame now contains 342,169 entries with no null values across all columns
# Resample the Bitocoin_df data to daily frequency
# First, create a mapping of 'Date' to the earliest 'Timestamp' of that date
timestamp_mapping = Bitcoin_df.groupby('Date')['Timestamp'].min().reset_index()
# Perform the aggregation by the 'Date' column
Bitcoin_df_daily = Bitcoin_df.groupby('Date').agg({
'Open': 'first',
'High': 'max',
'Low': 'min',
'Close': 'last',
'Volume': 'sum',
'MarketCap': 'mean', # the mean MarketCap for the day
'crypto_name': 'first' # crypto_name is constant for each day
}).reset_index()
# Merge this mapping with the daily aggregated data
Bitcoin_df_daily = Bitcoin_df_daily.merge(timestamp_mapping, on='Date')
Bitcoin_df_daily.head() # Check the changhes made
| Date | Open | High | Low | Close | Volume | MarketCap | crypto_name | Timestamp | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-12-31 | 4.39 | 4.58 | 4.39 | 4.58 | 95.32 | 4.47 | Bitcoin | 2011-12-31 07:52:00+00:00 |
| 1 | 2012-01-01 | 4.58 | 5.00 | 4.58 | 5.00 | 21.60 | 4.81 | Bitcoin | 2012-01-01 04:16:00+00:00 |
| 2 | 2012-01-02 | 5.00 | 5.00 | 5.00 | 5.00 | 19.05 | 5.00 | Bitcoin | 2012-01-02 20:04:00+00:00 |
| 3 | 2012-01-03 | 5.32 | 5.32 | 5.14 | 5.29 | 88.04 | 5.25 | Bitcoin | 2012-01-03 11:45:00+00:00 |
| 4 | 2012-01-04 | 4.93 | 5.57 | 4.93 | 5.57 | 107.23 | 5.21 | Bitcoin | 2012-01-04 04:17:00+00:00 |
Analysis of the 'Resample the Bitcoin_df data to daily frequency'¶
Creating a Timestamp MappinIWe eaadte a mapping (timestamp_mapping) that associates each 'Date' with the earliest 'Timestamp' of that day. This is done by grouping Bitcoin_df by 'Date' and then taking the minimum 'Timestamp' for each group.
**Aggregating Daily DataIt was performed orm daily aggregation on the Bitcoin_df by grouping it by 'Date'. each groupgrthe, alculcuion the fos thewing metrics:
- 'Open': The first (earliest) value.
- 'High': The maximum value.
- 'Low': The minimum value.
- 'Close': The last (latest) value.
- 'Volume': The sum of all values.
- 'MarketCap': The mean (average) value.
- 'crypto_name': The first (should be the same for all entries of a given day).
Merging with TimestamMpinddg: We merge the aggregated daily data (Bitcoin_df_daily) with the timestamp_mapping. This step adds the earliest 'Timestamp' for each day back to the aggregated data.
The final DataFrame Bitcoin_df_daily: This includes daily aggregated OHLC data, total volume, average market cap, the cryptocurrency name, and the earliest 'Timestamp' for each day. This format is useful for analysis that requires daily granularity while also keeping trak of the exactnn time the tradig day started. day started.
Part 1 c: Concat the final process datasets: Bitcoin_df_daily and Cryptocurrency_df¶
# Concat the final processed datasets Bitcoin_df_daily and Cryptocurrency_df:
bitcoin_crypto_concat_df = pd.concat([Bitcoin_df_daily, Cryptocurrency_df], keys=['Bitcoin', 'Cryptocurrency'])
bitcoin_crypto_concat_df.tail() #Check the changes made
#print(bitcoin_crypto_concat_df.columns)
| Date | Open | High | Low | Close | Volume | MarketCap | crypto_name | Timestamp | ||
|---|---|---|---|---|---|---|---|---|---|---|
| Cryptocurrency | 72941 | 2022-10-23 00:00:00 | 0.02 | 0.02 | 0.02 | 0.02 | 40401338.08 | 1652957186.63 | VeChain | 2022-10-23 23:59:59.999000+00:00 |
| 72942 | 2022-10-23 00:00:00 | 1.47 | 1.53 | 1.44 | 1.52 | 28443510.43 | 1572824923.08 | Flow | 2022-10-23 23:59:59.999000+00:00 | |
| 72943 | 2022-10-23 00:00:00 | 4.95 | 5.15 | 4.95 | 5.12 | 106949683.62 | 1559551358.49 | Filecoin | 2022-10-23 23:59:59.999000+00:00 | |
| 72944 | 2022-10-23 00:00:00 | 0.00 | 0.00 | 0.00 | 0.00 | 214326817.63 | 1576291167.45 | Terra Classic | 2022-10-23 23:59:59.999000+00:00 | |
| 72945 | 2022-10-23 00:00:00 | 0.47 | 0.47 | 0.45 | 0.47 | 950974304.98 | 23398675901.61 | XRP | 2022-10-23 23:59:59.999000+00:00 |
Analysis of Part 1c - Concatenate the Final Processed Datasets Bitcoin_df_daily and Cryptocurrency_df¶
In this stepIwe concatenadte the datasets Bitcoin_df_daily and Cryptocurrency_df using the pd.concat() function.
Impact on Data Analysis:
Unified Dataset: Concatenation results in a sdata f DataFrame that combines data from both Bitcoin and other cryptocurrencies. This unified dataset is beneficial for comparative analysis across different cryptocurrencies.
Hierarchical Indexing: The multi-level index (created using the
keysparameter) will help distinguish between data originating from Bitcoin-specific records and other cryptocurrencies. This distinction is crucial for any analysis that compares Bitcoin against other cryptocurrencies.Consistency and Integrity: Ensure that both DataFrames (
Bitcoin_df_dailyandCryptocurrency_df) have the same structure (i.e., column names and types). Inconsistencies might lead to misleading results or errors dysis. ysis
Part 2: Exploration of the Concat DataFrame 'bitcoin_crypto_concat_df' :¶
Objectives: pixel valuethe s, frequency content of a signal changes over timthe e, harmonic content of the audio, e
- Provide a summary of basic statistics and characteristics of the imported piece/s of data, including measures of central tendency, dispersion, visualzsations where applicable, etc. Explain how these statistics inform your understanding of the data. For example, doing a visual inspection, statistics of
**Challenges:- Thoroughly and accurately provide a comprehensv summary of data statistic . Insightfully explain ow the statistics inform a eep understanding of the data.ta.tata
# Get a list of all unique cryptocurrency names
unique_crypto_names = bitcoin_crypto_concat_df['crypto_name'].unique()
# Create a DataFrame from the unique names
unique_crypto_names_df = pd.DataFrame(unique_crypto_names, columns=['crypto_name'])
# Display the DataFrame
print(unique_crypto_names_df)
crypto_name 0 Bitcoin 1 Litecoin 2 XRP 3 Dogecoin 4 Monero 5 Stellar 6 Tether 7 Ethereum 8 Ethereum Classic 9 Maker 10 Basic Attention Token 11 EOS 12 Bitcoin Cash 13 BNB 14 TRON 15 Decentraland 16 Chainlink 17 Cardano 18 Filecoin 19 Theta Network 20 Huobi Token 21 Ravencoin 22 Tezos 23 VeChain 24 Quant 25 USD Coin 26 Cronos 27 Wrapped Bitcoin 28 Cosmos 29 Polygon 30 OKB 31 UNUS SED LEO 32 Algorand 33 Chiliz 34 THORChain 35 Terra Classic 36 FTX Token 37 Hedera 38 Binance USD 39 Dai 40 Solana 41 Avalanche 42 Shiba Inu 43 The Sandbox 44 Polkadot 45 Elrond 46 Uniswap 47 Aave 48 NEAR Protocol 49 Flow 50 Internet Computer 51 Casper 52 Toncoin 53 Chain 54 ApeCoin 55 Aptos
# Display the count of times each cryptocurrency appears in the dataset 'bitcoin_crypto_concat_df':
crypto_name_counts = bitcoin_crypto_concat_df['crypto_name'].value_counts()
print(crypto_name_counts)
crypto_name Bitcoin 3977 Litecoin 3248 XRP 3157 Dogecoin 3024 Monero 2866 Stellar 2791 Tether 2582 Ethereum 2424 Ethereum Classic 2072 Basic Attention Token 1760 EOS 1730 Bitcoin Cash 1708 BNB 1706 TRON 1656 Decentraland 1652 Chainlink 1649 Cardano 1638 Maker 1605 Filecoin 1565 Theta Network 1530 Huobi Token 1513 Ravencoin 1478 Tezos 1365 VeChain 1332 Quant 1325 USD Coin 1266 Cronos 1199 Wrapped Bitcoin 1152 Cosmos 1109 Polygon 1064 OKB 1062 UNUS SED LEO 1041 Algorand 1010 Chiliz 1000 THORChain 978 Terra Classic 975 FTX Token 969 Hedera 922 Binance USD 919 Dai 857 Solana 718 Shiba Inu 608 The Sandbox 595 Polkadot 589 Elrond 575 Uniswap 562 Avalanche 559 Aave 547 NEAR Protocol 535 Flow 435 Internet Computer 337 Casper 335 Toncoin 241 Chain 82 ApeCoin 80 Aptos 1 Name: count, dtype: int64
Analysis of the 'value_counts' method:¶
- Data Completeness and Reliability: The frequency of each cryptocurrency's appearance can indicate the completeness and reliability of the data. Cryptocurrencies with more data points (higher frequency) typically provide a more reliable basis for time series analysis, as there are more data points to analyze trends, volatility, and other patterns.
- Market Presence and Maturity: The count can give insights into the market presence and maturity of each cryptocurrency. Generally, older and more established cryptocurrencies like Bitcoin or Ethereum will have more data points, reflecting their longer presence in the market.
- Trend Analysis: The distribution of data points across different cryptocurrencies can help in comparative trend analysis. For example, by comparing the price movements of Bitcoin (with 3977 data points) and a less frequent cryptocurrency like ApeCoin (80 data points), one can understand how market dynamics differ between well-established and newer cryptocurrencies.
- Historical Analysis: The distribution can also show the evolution of the cryptocurrency market over time, indicating which cryptocurrencies have been consistently present and which are newer entries.
# Statistics of the contact dataset bitcoin_crypto_concat_df:
pd.set_option('display.float_format', '{:.2f}'.format)
bitcoin_crypto_concat_df.describe()
| Open | High | Low | Close | Volume | MarketCap | |
|---|---|---|---|---|---|---|
| count | 73675.00 | 73675.00 | 73675.00 | 73675.00 | 73675.00 | 73675.00 |
| mean | 862.52 | 888.54 | 836.58 | 863.62 | 2185763535.00 | 14603280572.71 |
| std | 5206.31 | 5372.45 | 5054.78 | 5210.14 | 9572676582.05 | 74653826583.86 |
| min | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 0.17 | 0.18 | 0.17 | 0.18 | 7659240.00 | 174113177.74 |
| 50% | 1.77 | 1.86 | 1.67 | 1.78 | 106019350.70 | 1231647176.40 |
| 75% | 26.91 | 28.22 | 25.52 | 26.98 | 652362199.25 | 5039051068.79 |
| max | 67549.74 | 162188.26 | 66458.72 | 67566.83 | 350967941479.06 | 1274831490851.01 |
Explanation of the '# Statistics of the contact dataset bitcoin_crypto_concat_df:'¶
Output Explained¶
count: Number of non-null entries in each column. All columns have 73,675 entries, indicating no missing values in these columns.
mean: The average value for each column. For example, the average 'Open' price is approximately $862.52.
min: The minimum value in each column. Notably, the minimum for 'Open', 'High', 'Low', 'Close', 'Volume', and 'MarketCap' is 0, indicating that there are entries with zero values. It could indicate either periods of extremely low activity/value for some cryptocurrencies or data collection anomalies that need to be investigated further.
25% (First Quartile): For each column, 25% of the entries are below this value. For instance, 25% of the 'Open' prices are below $0.17.
50% (Median): The median value of each column. For example, the median 'Open' price is $1.77.
75% (Third Quartile): For each column, 75% of the entries are below this value. E.g., 75% of 'Open' prices are below $26.91.
max: The maximum value in each column. For example, the maximum 'Open' price is $67,549.74.
std (Standard Deviation): Measures the amount of variation or dispersion in each column. For instance, 'Open' prices have a standard deviation of $5,206.31.
Impact on Analysis¶
Insight into Data Distribution: These statistics provide a quick overview of the distribution and range of the data. It helps in understanding the general behavior of cryptocurrency prices and market caps.
Identification of Outliers: Extreme values in 'min' and 'max' can indicate outliers. For example, a maximum 'High' price of $162,188.26 seems exceptionally high and might warrant further investigation.
Central Tendency and Dispersion: The mean, median, and standard deviation inform about the central tendency and spread of the data. For instance, a high standard deviation in 'Volume' and 'MarketCap' suggests large fluctuations in trading volume and market capitalization.
Data Integrity Check: The 'count' value helps in identifying if there are missing values that need to be addressed.
Preparation for Further Analysis: These statistics are crucial for preparing data for more complex analyses, such as time series forecasting, correlation analysis, or comparative studies between different cryptocurrencies.
In summary, the describe() function provides a comprehensive stati bitcoin_crypto_concat_dfstical summary of your dataset, serving as a fundamental step in exploratory data analysis. It helps in gaining a preliminary understanding of the data's characteristics, which is essential for guiding more detailed and specific analyses.
# Count of Unique Values
print(bitcoin_crypto_concat_df.nunique())
Date 3977 Open 72359 High 72337 Low 72387 Close 72361 Volume 72919 MarketCap 70567 crypto_name 56 Timestamp 3977 dtype: int64
Explanation of the 'Count of Unique Values' Analysis:¶
Interpretation and Use:
- High Variation in Financial Metrics: The large number of unique values in the Open, High, Low, Close, Volume, and MarketCap columns suggest a dynamic and fluctuating market, which is typical for cryptocurrencies.
- Diverse Cryptocurrency Data: With 56 unique cryptocurrencitheyour dataset offers a comprehensive view across multiple cryptocurrencies.
- Time Series Analysis Potential: Given the large number of unique dates and timestamps, this dataset is suitable for time-series analyses, such as trend analysis, forecasting, and studying seasonal pattrns. ns.
# General analysis of the concat dataset.
import matplotlib.pyplot as plt
import seaborn as sns
# Define the numerical columns to create box plots for
numerical_columns = ['Open', 'High', 'Low', 'Close', 'Volume', 'MarketCap']
# Create a color palette with a distinct color for each numerical column
palette = sns.color_palette("Set2", len(numerical_columns))
# Create a figure with subplots in a 2x3 layout
plt.figure(figsize=(12, 8))
# Loop over the numerical columns to create a box plot for each one
for i, column in enumerate(numerical_columns):
plt.subplot(2, 3, i + 1) # i starts at 0, so add 1 for subplot indexing
sns.boxplot(x=bitcoin_crypto_concat_df[column], color=palette[i])
plt.title(f'Box Plot for {column}')
plt.xlabel(column)
plt.tight_layout()
plt.show()
Analysis of the Box plots - General analysis of the concat dataset.¶
It provides a general overview of the distribution of values in the entire dataset. It analyzes outliers for each numerical column across the entire concatenated dataset without distinguishing between different cryptocurrencies. It treats the dataset as a whole, looking at the distribution of 'Open', 'High', 'Low', 'Close', 'Volume', and 'MarketCap' across all entries.
Box Plot Explanations¶
Box Plot for Open: This plot shows the distribution of the "Open" prices. The "Open" price is the price at which a stock first trades upon the opening of an exchange on a trading day.
Box Plot for High: This visualizes the distribution of the "High" prices, which represent the highest price at which a stock traded during the course of the day.
Box Plot for Low: This displays the "Low" prices distribution, indicating the lowest price at which a stock traded during the day.
Box Plot for Close: This plot represents the distribution of "Close" prices, the price at which a stock last traded during a regular trading day.
Box Plot for Volume: This shows how the trading volume is distributed. Trading volume represents the total number of shares or contracts traded for a particular security.
Box Plot for MarketCap: This illustrates the distribution of the market capitalization values. Market capitalization is the total market value of a company's outstanding shares of stock.
For each box plot, the central tendency is indicated by the line within the box, which is the median. The box represents the middle 50% of the data, which is the interquartile range (between the 25th and 75th percentile). The whiskers extend to the rest of the distribution, except for points that are determined to be outliers, which are plotted as individual points.
Observations from the plots suggest that:
The distributions for "Open," "High," "Low," and "Close" seem to be very similar, with the median close to the lower end of the data range, indicating a skewed distribution. In practical terms, this suggests that for these financial price variables ('Open,' 'High,' 'Low,' and 'Close'), there may be occasional extremely high values (outliers) that are driving the median higher, but the majority of the data points are concentrated in the lower price range.
The "Volume" and "MarketCap" plots show a different scale, suggesting different units or magnitudes of data. The distribution of "Volume" is also skewed with the median closer to the lower end of the range.
There are outliers present in all the plots, seen as individual points beyond the whiskers.
Decision why Not handle the Outliers:
- The objective of this assignment is to provide knowledge in time series. To handle these outliers, it is important to understand the financial market, and what happened during the period the cryptocurrencies were analyzed that made the prices fluctuate. In financial datasets, what might appear as an outlier could be the result of significant market events, such as a company's breakthrough product launch, a market crash, or regulatory changes affecting the industry. In summary, some outliers are essential to the analysis because they may represent critical market movements.
# Outliers by Cryptocurrency Name (Specific Analysis):
numerical_columns = ['Open', 'High', 'Low', 'Close', 'Volume', 'MarketCap']
# Create a color palette with a distinct color for each numerical column
palette = sns.color_palette("Set2", len(numerical_columns))
cryptocurrencies = bitcoin_crypto_concat_df['crypto_name'].unique() # Adjust the column name if necessary
for crypto in cryptocurrencies:
crypto_df = bitcoin_crypto_concat_df[bitcoin_crypto_concat_df['crypto_name'] == crypto]
plt.figure(figsize=(12, 8))
for i, column in enumerate(numerical_columns):
plt.subplot(2, 3, i + 1)
sns.boxplot(x=crypto_df[column], color=palette[i]) # Use the color palette here
plt.title(f'{crypto} - Box Plot for {column}')
plt.tight_layout()
plt.show()
Analyis of the Outliers by Cryptocurrency Name (Specific Analysis):¶
For further studies, the 'Outliers by Cryptocurrency Name (Specific Analysis)' provides a detailed view that allows us to compare the financial characteristics of each cryptocurrency separately.
- Box plot for Dogecoin: The box plots for 'Open', 'High', 'Low', and 'Close' share a similar profile, these metrics are often closely related. The median of these distributions is near the lower quartile, which indicates a right-skewed distribution with a concentration of data points at the lower end of the price range. The whiskers, which represent the variability outside the upper and lower quartiles, extend a significant distance from the box, suggesting a wide range of prices. There are also numerous points beyond the whiskers, indicating the presence of outliers; these could represent periods of high volatility or price spikes.
The presence of outliers in all metrics suggests that Dogecoin, like many cryptocurrencies, is subject to periods of high volatility and rapid changes in price, volume, and market capitalization.
# Box Plot of the numeric columns:
bitcoin_crypto_concat_df.boxplot()
plt.show()
Explanation for the 'Box Plot for the numeric columns:'¶
Observations:
- 'Open', 'High', 'Low', and 'Close' represent the opening, highest, lowest, and closing prices of a cryptocurrency for certain time periods.
- 'Volume' represents the number of shares or contracts traded for the cryptocuiesrency in a day.
- 'MarketCap' is the market capitalization, which is the total market value of the cryptocurrency's circulating supply.
It appears that 'Volume' has a much wider range and larger outliers compared to other categories, suggesting that trading volume varies significantly from day to day or over the observed time periods.
'MarketCap' also has a significant range, indicating variability in the market valuation of the cryptocurrency.
The 'Open', 'High', 'Low', and 'Close' values are relatively similar in terms of their median values and range, which is common for financial time series data where these values are often close to each other within a given ime period. period.
# Heatmap - Select only numeric columns for correlation
numeric_columns = bitcoin_crypto_concat_df.select_dtypes(include=['number'])
# Compute the correlation matrix
corr_matrix = numeric_columns.corr()
# Plot the heatmap
sns.heatmap(corr_matrix, annot=True)
plt.show()
Analysis of the Heatmap:¶
Interesting Points:
Correlation Among Price Metrics: The opening (Open), high (High), low (Low), and closing (Close) prices are highly correlated with each other, as expected, since they are all related to the price movement within the same trading period. This high correlation indicates that these metrics often move in sync, reflecting consistent price behavior within each trading period.
Volume and Price Correlation: Volume does not show a strong correlation with the opening, high, low, or closing prices. This observation suggests that the trading volume might be influenced by factors not directly related to the price changes within the same period. Such factors could include external market events, investor sentiment, or other cryptocurrencies' activities.
MarketCap and Price Correlation: MarketCap shows a strong positive correlation with the opening, high, low, and closing prices. This correlation is logical because the market capitalization of a cryptocurrency is typically calculated by multiplying the price by the circulating supply. Therefore, MarketCap tends to move in tandem
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
# Calculating the mean, median, and mode of the bitcoin_crypto_concat_df
mean_value = bitcoin_crypto_concat_df['MarketCap'].mean()
median_value = bitcoin_crypto_concat_df['MarketCap'].median()
mode_value = bitcoin_crypto_concat_df['MarketCap'].mode()[0]
# Plotting the histogram
plt.figure(figsize=(10, 6))
plt.hist(bitcoin_crypto_concat_df['MarketCap'], bins=30, color='blue', alpha=0.7)
# Adding a vertical line for mean, median, and mode
plt.axvline(mean_value, color='red', linestyle='dashed', linewidth=1, label=f'Mean: {mean_value:.2f}')
plt.axvline(median_value, color='green', linestyle='dashed', linewidth=1, label=f'Median: {median_value:.2f}')
plt.axvline(mode_value, color='yellow', linestyle='dashed', linewidth=1, label=f'Mode: {mode_value:.2f}')
# Annotating the mean, median, and mode on the plot
plt.text(mean_value, plt.ylim()[1]*0.9, f'Mean: {mean_value:.2f}', color='red')
plt.text(median_value, plt.ylim()[1]*0.8, f'Median: {median_value:.2f}', color='green')
plt.text(mode_value, plt.ylim()[1]*0.7, f'Mode: {mode_value:.2f}', color='yellow')
# Setting the title and labels
plt.title('Histogram of Cryptocurrency MarketCap')
plt.xlabel('MarketCap')
plt.ylabel('Frequency')
# Show legend
plt.legend()
# Show the plot
plt.show()
Analysis of Market Capitalization Distribution¶
Right-Skewed Distribution: The histogram indicates a right-skewed (positively skewed) distribution. This is evident from the concentration of data on the left side of the histogram and the long tail extending to the right. In practical terms, this means there are a few cryptocurrencies with very high market capitalization values and a large number of cryptocurrencies with relatively low market capitalization values.
Central Tendency Measures: The mean is greater than the median, which is also a characteristic of a right-skewed distribution. Since the mean is influenced by extreme values, the higher market cap outliers push the mean to the right of the median.
Mode: The mode is indicated as zero, which suggests that the most common value for market capitalization in the dataset is 0. This could be dseveralber of cryptocurrencies that are new, inactive, or otherwise have no market capitalizataset.
Analysis Impact: In financial data like this, a right-skewed distribution is common because a small number of entities (cryptocurrencies, in this case) tend to dominate the market. However, using the mean as a measure of central tendency can be misleading because it doesn't accurately represent the typical market cap for most cryptocurrenci
es. The median is a better measure here because it is less affected by extreme outliers and gives a more realistic picture of the central tendency of the market capitalization of a "typical" cryptocurrency.
Part 3: Focus on creating meaningful visualizations¶
Objectives:
- Create meaningful visualizations (e.g., time series plots, spectrograms for audio, image histograms) to represent key features and patterns in the data or a subset of interest. Discuss the insights gained from these visuazisations. Identify and describe any notable patterns or anomalies observed ithe n analysis. Discuss their potential implications or significance within the context of your analy
Challenges:Create exceptional and insightful visualzsations effectively highlighting k y features and patter s. Provide a detailed and profound discussi n of insights gained fr m the zisualisations.tions.sis
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
# Reset the current MultiIndex
bitcoin_crypto_concat_df = bitcoin_crypto_concat_df.reset_index()
# Convert the 'Date' column to datetime
bitcoin_crypto_concat_df['Date'] = pd.to_datetime(bitcoin_crypto_concat_df['Date'])
# Set 'Date' as the new index
bitcoin_crypto_concat_df.set_index('Date', inplace=True)
# Calculate the median of 'Open', 'High', 'Low', and 'Close' prices
bitcoin_crypto_concat_df['Median'] = bitcoin_crypto_concat_df[['Open', 'High', 'Low', 'Close']].median(axis=1)
# Create a figure and set its size
plt.figure(figsize=(24, 15))
# Plot the median prices
plt.plot(bitcoin_crypto_concat_df.index, bitcoin_crypto_concat_df['Median'], label='Median', color='blue')
# Define the date format
years = mdates.YearLocator() # Every year
years_fmt = mdates.DateFormatter('%Y')
ax = plt.gca()
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(years_fmt)
ax.xaxis.set_minor_locator(mdates.MonthLocator())
plt.xticks(rotation=45)
plt.tick_params(axis='x', which='major', labelsize=10)
plt.title('Median Cryptocurrency Prices over the Years')
plt.xlabel('Date')
plt.ylabel('Median Price')
plt.legend()
plt.tight_layout()
plt.show()
Explanation of the 'Historical Median Cryptocurrency Prices over the Years':¶
Time Span: The graph covers roughly 11 years of Crypto's price history.
Price Fluctuations: There are significant fluctuations in the median price, with several peaks and troughs. The highest peaks are seen in the right half of the chart, suggesting significant price increases during these periods.
Price Scale: The price is scaled from 0 to approximately 160,000, which suggests that the currency experienced periods of extremely high valuation.
Volatility: The thin spikes, especially the ones representing the 'High' price, indicate days with large ranges between the high and low prices, which is characteristic of high volatility in the market.
General Trend: The prices appear to rise over time with a few significant peaks, suggesting periods of rapid price increase followed by declines.
# Most Valuable Cryptocurrency
crypto_values = bitcoin_crypto_concat_df.groupby('crypto_name')['MarketCap'].last()
# Sort the values to get the most valuable cryptocurrencies
crypto_values_sorted = crypto_values.sort_values(ascending=False)
# Create a bar graph with more granular settings
plt.figure(figsize=(12, 6)) # Adjust the figure size
ax = crypto_values_sorted.plot(kind='bar')
plt.title('Most Valuable Cryptocurrency')
plt.xlabel('Cryptocurrency')
plt.ylabel('Value')
# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha="right")
# Limit the number of cryptocurrencies displayed (for example, top 10)
num_cryptos_to_display = 10
ax.set_xticks(range(len(crypto_values_sorted.index[:num_cryptos_to_display])))
ax.set_xticklabels(crypto_values_sorted.index[:num_cryptos_to_display])
plt.show()
#Cryptocurrency Transactions Over Time:
# Group by cryptocurrency name and year, then count the number of occurrences
counts = bitcoin_crypto_concat_df.groupby([bitcoin_crypto_concat_df.index.year, 'crypto_name']).size()
# Unstack the groupby results to create a DataFrame where each column is a cryptocurrency
counts_df = counts.unstack(level='crypto_name', fill_value=0)
# Now you have your counts_df where the index is the year and each column represents a cryptocurrency.
print(counts_df)
# Example of how to plot this data (for the first 5 cryptocurrencies for clarity in the example)
plt.figure(figsize=(12, 6))
for crypto_name in counts_df.columns[:5]: # Just plot the first 5 for a cleaner plot
plt.plot(counts_df.index, counts_df[crypto_name], label=crypto_name)
plt.title('Cryptocurrency Transactions Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Transactions')
plt.legend()
plt.grid(True)
plt.show()
crypto_name Aave Algorand ApeCoin Aptos Avalanche BNB \ Date 2011 0 0 0 0 0 0 2012 0 0 0 0 0 0 2013 0 0 0 0 0 0 2014 0 0 0 0 0 0 2015 0 0 0 0 0 0 2016 0 0 0 0 0 0 2017 0 0 0 0 0 160 2018 0 0 0 0 0 365 2019 0 193 0 0 0 364 2020 88 358 0 0 100 358 2021 323 323 0 0 323 323 2022 136 136 80 1 136 136 crypto_name Basic Attention Token Binance USD Bitcoin Bitcoin Cash ... \ Date ... 2011 0 0 1 0 ... 2012 0 0 366 0 ... 2013 0 0 603 0 ... 2014 0 0 365 0 ... 2015 0 0 365 0 ... 2016 0 0 366 0 ... 2017 214 0 365 162 ... 2018 365 0 365 365 ... 2019 364 102 364 364 ... 2020 358 358 358 358 ... 2021 323 323 323 323 ... 2022 136 136 136 136 ... crypto_name Tezos The Sandbox Theta Network Toncoin UNUS SED LEO \ Date 2011 0 0 0 0 0 2012 0 0 0 0 0 2013 0 0 0 0 0 2014 0 0 0 0 0 2015 0 0 0 0 0 2016 0 0 0 0 0 2017 0 0 0 0 0 2018 184 0 349 0 0 2019 364 0 364 0 224 2020 358 136 358 0 358 2021 323 323 323 105 323 2022 136 136 136 136 136 crypto_name USD Coin Uniswap VeChain Wrapped Bitcoin XRP Date 2011 0 0 0 0 0 2012 0 0 0 0 0 2013 0 0 0 0 150 2014 0 0 0 0 365 2015 0 0 0 0 365 2016 0 0 0 0 366 2017 0 0 0 0 365 2018 85 0 151 0 365 2019 364 0 364 335 364 2020 358 103 358 358 358 2021 323 323 323 323 323 2022 136 136 136 136 136 [12 rows x 56 columns]
Explanation of the 'Cryptocurrency Transactions Over Time' Graph¶
The 'Cryptocurrency Transactions Over Time' graph is a visual representation of the transaction counts for various cryptocurrencies across different years. It allows a comparison of the activity levels associated with each cryptocurrency and shows how they have changed over time.
X-Axis (Horizontal): This axis represents time, specifically years, over which the transaction data has been collected.
Y-Axis (Vertical): This axis shows the number of transactions that have been recorded for each cryptocurrency.
Lines: Each line represents a different cryptocurrency. The trends observed in these lines can provide insights into the transactional activity of each cryptocurrency:
A rising line, such as what we might observe for Algorand, suggests an increase in transactions during a specific period (e.g., 2018). This could be indicative of the growing popularity or utility of that particular cryptocurrency.
A falling line, such as what might be observed for Avalanche, would indicate a decrease in transactions during the observed period. This might reflect declining interest or utility.
# Resample the data by month ('M') and calculate the median of 'Close' for each month
import plotly.express as px
# Resample the data by month ('M') and calculate the median of 'Close' for each month
monthly_data = bitcoin_crypto_concat_df.resample('M')['Close'].median().reset_index()
# Create an interactive figure using Plotly Express
fig = px.line(monthly_data, x='Date', y='Close', title='Cryptocurrency Price Over Time (Monthly Median)',
labels={'Close': 'Price (USD)'}, line_shape='linear')
# Customize the appearance of the plot
fig.update_traces(fill='tozeroy', fillcolor='rgba(255,165,0,0.5)', line=dict(color='orange'))
fig.update_xaxes(title_text='Date (Monthly)', showgrid=True)
fig.update_yaxes(showgrid=True)
# Rotate x-axis labels for better readability
fig.update_xaxes(tickangle=45)
# Show the interactive plot
fig.show()
Explanation of the 'Cryptocurrency Price Over Time (Monthly Median)':¶
The Goal: The primary goal of this analysis is to understand the general behavior of the cryptocurrency market as a whole. Utilizing the median as a metric can provide insights into the central tendency of prices across multiple currencies. This approach helps in smoothing out anomalies that are specific to any single cryptocurrency, offering a clearer view of the market's overall trends.
Chart Overview:
- The chart is a historical representation of the median monthly price of a list of cryptocurrencies.
- It highlights the volatile nature of cryptocurrency prices in the earlier years, characterized by significant fluctuations.
- In the later years, the chart shows a more stable pattern, indicating a possible maturation of the market or a decrease in volatility.
Why Use the Median:
- The median was chosen for this analysis to provide a more robust measure of central tendency.
- It is less affected by extreme values and outliers than the mean, making it a more reliable indicator in markets known for high price al currencies
# A closer look at a specific cryptocurrency, plot of Dogecoin over the years:
# Define the target cryptocurrency
target_crypto = 'Dogecoin'
# Filter the data for the specific cryptocurrency 'Dogecoin'
crypto_data = bitcoin_crypto_concat_df[bitcoin_crypto_concat_df['crypto_name'] == target_crypto]
# Create an interactive line plot using Plotly Express
fig = px.line(crypto_data, x='Timestamp', y='MarketCap',
labels={'Timestamp': 'Date', 'MarketCap': 'MarketCap'},
title=f'{target_crypto} MarketCap Prices Over Time',
line_shape='linear',
template='plotly_dark') # You can choose different templates for the plot's appearance
# Customize the appearance of the plot
fig.update_traces(line=dict(color='purple'), mode='lines')
fig.update_xaxes(showgrid=True, title_text='Date')
fig.update_yaxes(showgrid=True, title_text='MarketCap')
# Rotate x-axis labels for better readability
fig.update_xaxes(tickangle=45)
# Show the interactive plot
fig.show()
Analysis of Dogecoin's 'MarketCap Prices Over Time':¶
Sharp Rise in Late 2020 and Peak in 2021: The market capitalization of Dogecoin begins to rise sharply in late 2020, reaching a peak in 2021. This spike represents a dramatic increase in the value of Dogecoin, which could be attributed to a variety of factors, including increased investor interest, media attention,andrso ons.
Volatility in 2021: Following the initial peak, the market capitalization experidnces several sharp rises and falls, indicating a period of high volatility. This could be due to market reactions to news events, changes in the regulatory environment, or shifts in investor sentiment.
Decline and Fluctuations Post-2021: After the volatile period in 2021, there is a general decline in market capitalization, accompanied by some fluctuations. This trend might suggest a cooling off of the initial enthusiasm and a correction in the market value of Doe results.
# Finding the maximum variance in bitcoin_crypto_concat_df MarketCap per Year and Month.
# Extract year and month from the 'Date' column
bitcoin_crypto_concat_df['Year'] = bitcoin_crypto_concat_df['Timestamp'].dt.year
bitcoin_crypto_concat_df['Month'] = bitcoin_crypto_concat_df['Timestamp'].dt.month
# Calculate the variance of MarketCap for each year
annual_variance = bitcoin_crypto_concat_df.groupby('Year')['MarketCap'].var()
# Calculate the variance of MarketCap for each month (across all years)
monthly_variance = bitcoin_crypto_concat_df.groupby('Month')['MarketCap'].var()
# Find the year with the maximum variance
max_variance_year = annual_variance.idxmax()
# Find the month with the maximum variance
max_variance_month = monthly_variance.idxmax()
print(f"The period with the most variance in MarketCap is: Year {max_variance_year}, Month {max_variance_month}")
The period with the most variance in MarketCap is: Year 2021, Month 4
import matplotlib.dates as mdates
# Visualizing the maximum variance in bitcoin_crypto_concat_df MarketCap per Year and Month.
# Convert start and end dates to timezone-aware datetime objects
start_date = pd.to_datetime('2021-04-01').tz_localize('UTC')
end_date = pd.to_datetime('2021-04-30').tz_localize('UTC')
# Filter the DataFrame
mask = (bitcoin_crypto_concat_df['Timestamp'] >= start_date) & (bitcoin_crypto_concat_df['Timestamp'] <= end_date)
filtered_df = bitcoin_crypto_concat_df.loc[mask]
dates = filtered_df['Timestamp']
closing_prices = filtered_df['MarketCap']
# Create a line plot
plt.figure(figsize=(12, 6))
plt.plot(dates, closing_prices, label='Cryptocurrency MarketCap', color='orange')
# Formatting Date on the x-axis
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=1))
# Set the x-axis limits
plt.xlim(start_date, end_date)
# Rotate x-axis labels for better readability
plt.xticks(rotation=45)
# Set labels and title
plt.xlabel('Date')
plt.ylabel('MarketCap')
plt.title('Cryptocurrency MarketCap April 2021')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Analysis of the 'Cryptocurrency MarketCap April 2021'¶
Significant Spikes:
- There are notable spikes in MarketCap, especially arouAprilber. These pronounced increases could be attributed to a sudden surge in the cryptocurrency's price, a dramatic rise in circulating supply, or a combination of both factors.
Period of Stability:
- The MarketCap shows periods of relative stability, where fluctuations are minimal. This suggests there were intervals of decreased activity or stable pricing during those times.
Analysis Insight:
- The line graph indicaApril November 2021 was a period of significant dynamism within the cryptocurrency markets, characterized by phases of high volatility and periods of stability. To fully understand the forces behind these market movements, one would need to consider additional information and external events not shown in the graph, such as market news, economic factors, or changes in investor sent. h.s.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
import ipywidgets as widgets
from IPython.display import display
# Define the plot_seasonal_decomposition function
def plot_seasonal_decomposition(period):
result = seasonal_decompose(bitcoin_crypto_concat_df['Open'], model='additive', period=period)
# Perform seasonal decomposition
result = seasonal_decompose(bitcoin_crypto_concat_df['Open'], model='additive', period=90)
# Plot the original time series, trend, seasonal, and residual components
plt.figure(figsize=(12, 8))
plt.subplot(411)
plt.plot(result.observed, label='Observed', color='blue')
plt.legend(loc='upper left')
plt.title('Original Time Series')
plt.subplot(412)
plt.plot(result.trend, label='Trend', color='green')
plt.legend(loc='upper left')
plt.title('Trend Component')
plt.subplot(413)
plt.plot(result.seasonal, label='Seasonal', color='red')
plt.legend(loc='upper left')
plt.title('Seasonal Component')
plt.subplot(414)
plt.plot(result.resid, label='Residual', color='purple')
plt.legend(loc='upper left')
plt.title('Residual Component')
plt.tight_layout()
plt.show()
# Create an interactive widget for selecting the period
period_slider = widgets.IntSlider(value=90, min=1, max=365, step=1, description='Seasonal Period:')
widgets.interactive(plot_seasonal_decomposition, period=period_slider)
interactive(children=(IntSlider(value=90, description='Seasonal Period:', max=365, min=1), Output()), _dom_cla…
The graph includes several plots, each representing different components of the data Sesaonal Period (90):¶
Blue Plot (Observed Data): This plot shows the actual observed data, which appears volatile and exhibits a pattern that repeats with some regularity. The cyclical nature of the blue plot suggests the presence of seasonality within the dataset.
Green Plot (Trend Component): The trend component is illustrated by the green plot. It appears relatively flat, which suggests that there is no strong overarching trend in the data over the observed time period.
Red Plot (Seasonal Component): The seasonal component is depicted by the red plot. It clearly shows regular peaks and troughs at consistent intervals, confirming the initial observation of seasonality in the data.
Purple Plot (Residual Component): The residual component is shown in purple. In an ideal decomposition model, this plot would show random noise with no discernible pattern, indicating that all seasonality and trend have been adequately captured by the model.
These components are derived from a time series decomposition, where the original signal is separated into these distinct elements to better understand and analyze the underlying patterns and structures inthe data.
bitcoin_crypto_concat_df.head()
| level_0 | level_1 | Open | High | Low | Close | Volume | MarketCap | crypto_name | Timestamp | Median | Year | Month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||||
| 2011-12-31 | Bitcoin | 0 | 4.39 | 4.58 | 4.39 | 4.58 | 95.32 | 4.47 | Bitcoin | 2011-12-31 07:52:00+00:00 | 4.48 | 2011 | 12 |
| 2012-01-01 | Bitcoin | 1 | 4.58 | 5.00 | 4.58 | 5.00 | 21.60 | 4.81 | Bitcoin | 2012-01-01 04:16:00+00:00 | 4.79 | 2012 | 1 |
| 2012-01-02 | Bitcoin | 2 | 5.00 | 5.00 | 5.00 | 5.00 | 19.05 | 5.00 | Bitcoin | 2012-01-02 20:04:00+00:00 | 5.00 | 2012 | 1 |
| 2012-01-03 | Bitcoin | 3 | 5.32 | 5.32 | 5.14 | 5.29 | 88.04 | 5.25 | Bitcoin | 2012-01-03 11:45:00+00:00 | 5.30 | 2012 | 1 |
| 2012-01-04 | Bitcoin | 4 | 4.93 | 5.57 | 4.93 | 5.57 | 107.23 | 5.21 | Bitcoin | 2012-01-04 04:17:00+00:00 | 5.25 | 2012 | 1 |
import plotly.express as px
import pandas as pd
#
bitcoin_df = bitcoin_crypto_concat_df[bitcoin_crypto_concat_df['crypto_name'] == 'Bitcoin']
# Create an line graph
fig = px.line(bitcoin_df, x='Timestamp', y='Close', title='Bitcoin Closing Price Over Time')
# Personalize o layout do gráfico (opcional)
fig.update_xaxes(title_text='Data')
fig.update_yaxes(title_text='Bitcoin Closing Price (USD)')
# Create an interactive graph
fig.show()
Bitcoin Closing Price Over Time¶
Initial Stability: From the beginning of the chart until around 2017, Bitcoin's price was relatively stable, hovering around the lower end of the price spectrum.
Significant Increase: There's a notable increase in price beginning around 2017, where Bitcoin's value starts to rise sharply.
Peak Value: The graph shows that Bitcoin reached its peak value between late 2017 and early 2018, and then again around early 2021. These peaks appear to be the highest points on the graph, suggesting that Bitcoin reached its highest closing prices during these periods.
Volatility: After each peak, there is significant volatility, with sharp rises and falls in the closing price of Bitcoin.
Recent Trend: The most recent part of around 2022 tbly 2022, shows a downward trend, indicating a decrease in Bitcoin's closing price from previ
Prediction Without access to additional context, such as news events or market changes, this analysis is purely based on the visual trends presented in the graph. For a detailed financial analysis, historical market data and context would be essential.us highs.
Predictive Analysis Disclaimer¶
Please note that the analysis presented is based solely on the observable trendbased on historical price h and does not take into account additional market data, news events, or other external factors that could influencCryptocurrencyin's price. The insights offered here are for academic purposes within the scope of a postgraduate project and are not intended to predict actual cryptocurrency price movements. Due to the inherent volatility of cryptocurrency markets, any predictive models would necessitate a comprehensive approach that integrates a wide array of variables beyond just historical price trend